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Although empirical and neural studies show that serotonin (5HT) plays many functional 
roles in the brain, prior computational models mostly focus on its role in behavioral 
inhibition. In this study, we present a model of risk based decision making in a modified 
Reinforcement Learning (RL)-framework. The model depicts the roles of dopamine (DA) 
and serotonin (5HT) in Basal Ganglia (BG). In this model, the DA signal is represented 
by the temporal difference error (S), while the 5HT signal is represented by a parameter 
(a) that controls risk prediction error. This formulation that accommodates both 5HT and 
DA reconciles some of the diverse roles of 5HT particularly in connection with the BG 
system. We apply the model to different experimental paradigms used to study the 
role of 5HT: (1) Risk-sensitive decision making, where 5HT controls risk assessment, 

(2) Temporal reward prediction, where 5HT controls time-scale of reward prediction, and 

(3) Reward/Punishment sensitivity, in which the punishment prediction error depends on 
5HT levels. Thus the proposed integrated RL model reconciles several existing theories of 
5HTand DA in the BG. 

Keywords: serotonin, dopamine, basal ganglia. Reinforcement Learning, Risk, Reward, Punishment, Decision 
Making 



INTRODUCTION 

Monoamine neuromodulators such as dopamine, serotonin, 
norepinephrine and acetylcholine are hailed to be the most 
promising neural messengers to ensure healthy adaptation to 
our uncertain environments. Specifically, serotonin (5HT) and 
dopamine (DA) play important roles in various cognitive pro- 
cesses, including reward and punishment learning (Cools et al., 
2011; Rogers, 2011). DA signaling has been linked to reward 
processing in the brain for a long time (Bertler and Rosengren, 
1966). Furthermore the activity of mesencephalic DA neurons 
are found to closely resemble temporal difference error (TD) in 
Reinforcement Learning (RL) (Schultz, 1998). This TD error rep- 
resents the difference in the total reward (outcome) that the agent 
or subject receives at a given state and time, and the total pre- 
dicted reward. The semblance between the TD error signal and 
DA signal served as a starting point of an extensive theoretical 
and experimental effort to apply concepts of RL to understand the 
functions of the Basal Ganglia (BG) (Schultz et al., 1997; Sutton 
and Barto, 1998; Joel et al, 2002; Chakravarthy et al., 2010). This 
led to the emergence of a framework for understanding the BG 
functions in which the DA signal played a crucial role. Deficiency 
of such a neuromodulator (DA) leads to symptoms observed in 
neurodegenerative disorders like Parkinson's Disease (Bertler and 
Rosengren, 1966; Goetz et al., 2001). 



THE MULTIPLE FUNCTIONS OF SEROTONIN 

It is well-known that dopamine is not the only neuromodulator 
that is associated with the BG function. Serotonin (5HT) pro- 
jections to the BG are also known to have an important role 
in decision making (Rogers, 2011). 5HT is an ancient molecule 
that existed even in plants (Angiolillo and Vanderkooi, 1996). 
Through its precursor tryptophan, 5HT is linked to some of the 
fundamental processes of life itself. Tryptophan-based molecules 
in plants are crucial for capturing the light energy necessary 
for glucose metabolism and oxygen production (Angiolillo and 
Vanderkooi, 1996). Thus, by virtue of its fundamental role in 
energy conversion, 5HT is integral to mitosis, maturation, and 
apoptosis. In lower organisms, it modulates the feeding behav- 
ior and other social behaviors such as dominance posture, and 
escape responses (Kravitz, 2000; Azmitia, 2001; Chao et al., 2004). 
Due to its extended role as a homeostatic regulator in higher 
animals and in mammals, 5HT is also associated with appetite 
suppression (Azmitia, 1999; Halford et al, 2005; Gillette, 2006). 
Furthermore, 5HT plays important roles in anxiety, depression, 
inhibition, hallucination, attention, fatigue, and mood (Tops 
et al., 2009; Cools et al., 2011). Increasing 5HT level leads to 
decreasing punishment prediction, though recent evidence point- 
ing to the role of DA in processing aversive stimuli makes the 
picture more complicated (So et al, 2009; Boureau and Dayan, 
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2011). The tendency to pay more attention to negative than 
positive experiences or other kinds of information (negative cog- 
nitive biases) are found to occur at lower levels of 5HT (Cools 
et al., 2008; Robinson et al., 2012). 5HT is also known to con- 
trol the time scale of reward prediction (Tanaka et al., 2007) 
and to play a role in risk sensitive behavior (Long et al., 2009; 
Murphy et al., 2009; Rogers, 2011). Studies found that under 
conditions of tryptophan depletion, which is known to reduce 
the brain 5HT level, risky choices are preferred to safer ones 
in decision making tasks (Long et al., 2009; Murphy et al., 
2009; Rogers, 2011). Reports about 5HT transporter gene influ- 
encing risk based decision making also exist (He et al, 2010; 
Kuhnen et al, 2013). 5HT is known to influence non-linearity 
in risk-based decision making (Kahneman and Tversky, 1979) — 
risk-aversivity in the case of gains and risk-seeking during losses, 
while presented with choices of equal means (Murphy et al, 2009; 
Zhong et al., 2009a,b). In summary, 5HT is not only impor- 
tant for behavioral inhibition, but is also related to time scales 
of reward prediction, risk, anxiety, attention etc., and to non- 
cognitive functions like energy conversion, apoptosis, feeding, 
and fatigue. 

PRIOR THEORETICAL AND COMPUTATIONAL ABSTRACT MODELS OF 
SEROTONIN 

It would be interesting to understand and reconcile the roles 
of DA and 5HT in the BG. Prior abstract models address- 
ing the same quest such as that by Daw et al. (2002) argue 
that DA signaling plays a role that is complementary to 5HT. 
It has been suggested that whereas the DA signal responds to 
appetitive stimuli, 5HT responds to aversive or punitive stim- 
uli (Daw et al, 2002). Unlike computational models that argue 
for complementary roles of DA and 5HT, empirical studies 
show that both neuromodulators play cardinal roles in cod- 
ing the signals associated with the reward (Tops et al., 2009; 
Cools et al, 2011; Rogers, 2011). Genes that control neuro- 
transmission of both molecules are known to affect processing 
of both rewarding and aversive stimuli (Cools et al., 2011). 
Complex interactions between DA and 5HT make it difficult to 
tease apart precisely the relative roles of the two molecules in 
reward evaluation. Some subtypes of 5HT receptors facilitate DA 
release from the midbrain DA releasing sites, while others inhibit 
them (Alex and Pehek, 2007). In summary, it is clear that the 
relationship between DA and 5HT is not one of simple com- 
plementarity. Both synergistic and opposing interactions exist 
between these two molecules in the brain (Boureau and Dayan, 
2011). 

Efforts have been made to elucidate the function of 5HT 
through abstract modeling. Daw et al. (2002) developed a line 
of modeling that explores an opponent relationship (Daw et al., 
2002; Dayan and Huys, 2008) between DA and 5HT. In an 
attempt to embed all the four key neuromodulators — DA, 5HT, 
norepinephrine and acetylcholine — within the framework of RL, 
Doya (2002) associated 5HT with discount factor, y, which is a 
measure of time-scale of reward integration (Doya, 2002; Tanaka 
et al., 2007). There is no single computational theory that inte- 
grates and reconciles the existing computational perspectives of 
5HT function in a single framework. 



OUR MODEL IN BRIEF 

In this modeling study, we present a model of both 5HT and 
DA in BG simulated using a modified RL framework. Here, 
DA represents TD error as in most extant literature of DA sig- 
naling and RL (Schultz et al, 1997; Sutton and Barto, 1998), 
and 5HT controls risk prediction error. Action selection is con- 
trolled by the utility function that is a weighted combination 
of both the value and risk function (Bell, 1995; Preuschoff 
et al, 2006; D'acremont et al., 2009). In the proposed mod- 
ified formulation of utility function, the weight of the risk 
function depends on the sign of the value function and a trade- 
off parameter a, which we describe in detail below. Just as 
value function was thought to be computed in the striatum, 
we now propose that the utility function is computed in the 
striatum. 

The outline of the paper is as follows: Section Methods 
describes the model equations. In Section Results, we show that 
a combination of both value and the risk function for decision 
making explains the following experiments. The first of these 
pertains to risk sensitivity in bee foraging (Real, 1981). Here we 
demonstrate that the proposed 5HT and DA model can simu- 
late this simple neurobiological instance of risk-based decision 
making. We then show the capability of the model to explain the 
roles of 5HT in the representative experimental conditions: risk 
sensitivity in Tryptophan depleted conditions (Long et al., 2009); 
time-scale of reward prediction (Tanaka et al., 2007); and reward 
and punishment sensitivity (Cools et al., 2008). We present 
the discussion on the model and results in Section Discussion. 
Furthermore in the discussion, we hypothesize that the plausi- 
ble neural correlates for the risk component are the D1R and the 
D2R co-expressing medium spiny neurons of the striatum, with 
serotonin selectively modulating this population of neurons. 

METHODS 

On the lines of the utility models described by Bell (1995) and 
D'acremont et al. (2009), we present here the utility function, 
U t as a tradeoff between the expected payoff and the variance 
of the payoff (the subscript "f refers to time). The original 
Utility formulation used in Bell (1995; D'acremont et al. (2009) 
is (Equation 2.1). 

U t (s, a) = Q f (s, a) - K^/h t (s, a) (2.1) 

where Q t is the expected cumulative reward and h t is the risk 
function or reward variance, for state, s, action, a; k is the risk 
preference. Note that in equation. 2.1, we represent the state and 
action explicitly as opposed to (Bell, 1995; D'acremont et al., 
2009). 

In classical RL (Sutton and Barto, 1998) terms, following pol- 
icy, 7T, the action value function, Q, at time t of a state, "5," and 
action, "a" may be expressed as (Equation 2.2). 

Q K (s, a) = E n (r t+ i + yr t + 2 + y 2 r t+3 

H \s t = 5, a t = a) (2.2) 

where r t is the reward obtained at time, f, and y is the discount 
factor (0 < y < 1). E n denotes the expectation when action 
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selection is done with policy jr. The incremental update for the 
action value function, Q is defined as in Equation 2.3. 

Qt + i(s t , a t ) = Q t (s t , a t ) + r/ Q <5 ( (2.3) 

where s t is the state at time, t; a t is the action performed at time, t, 
and iiq is the learning rate of the action value function (0 < tiq < 
1). <5 f is the TD error defined by Equation 2.4, 

8t = r t+ i + yQ t (s t +i, a t+ i)-Q t (s t , a t ) (2.4) 

In the case of immediate reward problems, S t is defined by 
Equation 2.5. 

S t = r t - Qt (s t , a t ) (2.5) 

Similar to the value function, the risk function "h t " has an 
incremental update as defined by Equation 2.6. 

h t + i(s t , a t ) = h t (s t , at) + r\h%t (2.6) 

where rjh is the learning rate of the risk function (0 < < 1), 
and %t is the risk prediction error expressed by Equation 2.7, 

& = h] - h t (s t , a t ) (2.7) 

r/i, and rjQ are set to 0.1, and Q t and h t are set to zero 
at t = 0 for simulations of (sections Risk Sensitivity and 
Rapid Tryptophan Depletion, Time Scale of Reward Prediction 
and Serotonin, Reward/Punishment Prediction Learning and 
Serotonin) described below. 

We now present a modified form of the utility function by 
substituting k = a.sign[Q t (s t , a t )] in (Equation 2.1). 

U t (s t , a t ) = Q t (s t , at) ~ otsign(Q t (s t , a t ))jh t {s t , a,) (2.8) 

In (Equation 2.8), the risk preference includes three 
components — the "a" term, the "sign(Qt)" term, and the risk 
term sfh t . The sign( Q t ) term achieves a familiar feature of human 
decision making viz., risk-aversion for gains and risk-seeking 
for losses (Kahneman and Tversky, 1979). In other words, when 
sign(Qt) is positive (negative), U t is maximized (minimized) by 
minimizing (maximizing) risk. Note that the expected action 
value Q t would be positive for gains that earn rewards greater 
than a reward base (= 0), and would be negative otherwise during 
losses. We associate 5HT level with a, a constant that controls the 
relative weightage between action value and risk (Equation 2.8). 

In this study, action selection is performed using softmax dis- 
tribution (Sutton and Barto, 1998) generated from the utility. 
Note that traditionally the distribution generated from the action 
value is used. The probability, P t (a\s) of selecting an action, a, for 
a state, s, at time, f, is given by the softmax policy (Equation 2.9). 

P t (a\s) = exp(pU t (s, a)) / £ exp {fiU t (s, i)) (2.9) 
/ < = l 

n is the total number of actions available at state, s, and /S is the 
inverse temperature parameter. Values of f) tending to 0 make the 



actions almost equiprobable and the /3 tending to oo make the 
softmax action selection identical to greedy action selection. 

RESULTS 

In this section, we apply the model of 5HT and DA in BG 
(Section Methods) to explain several risk-based decision making 
phenomena pertaining to BG function. 

1) Measurement of risk sensitivity: Two experiments are simu- 
lated in this category: 

- Risk sensitivity in Bee foraging (Real, 1981) 

- Risk sensitivity and Tryptophan depletion (Long et al., 2009) 

2) Representation of time scale of reward prediction (Tanaka 
et al, 2007) and 

3) Measurement of punishment sensitivity (Cools et al., 2008). 

The parameters for each experiment are optimized using genetic 
algorithm (GA) (Goldberg, 1989) (Details of the GA option set 
are given in Supplementary material). 

RISK SENSITIVITY IN BEE FORAGING 
Experiment summary 

In the bee foraging experiment by Real (1981), bees were allowed 
to choose between flowers of two colors — blue and yellow. Both 
types of flowers deliver the same amounts of mean reward (nec- 
tar) but differ in the reward variance. The experiment showed that 
bees prefer the less risky flowers i.e., the one with lesser variance 
in nectar (Real, 1981). 

Biogenic amines such as 5HT are found to influence foraging 
behavior in bees (Schulz and Robinson, 1999; Wagener-Hulme 
et al., 1999). In particular, the brain levels of dopamine, sero- 
tonin, and octopamine are found to be high in foraging bees 
(Wagener-Hulme et al, 1999). Montague et al. (1995) showed 
risk aversion in bee foraging using a general predictive learning 
framework without mentioning DA. They assume a special "sub- 
jective utility" which is a non-linear reward function (Montague 
et al, 1995) to account for the risk sensitivity of the subject. In the 
foraging problem of (Real, 1981) bees choose between two flow- 
ers that have the same mean reward but differ in risk or reward 
variance. Therefore, the problem is ideally suited for risk-based 
decision making approach. We show that the task can be mod- 
eled, without any assumptions about "subjective utility," by using 
the proposed 5HT-DA model which has an explicit representation 
for risk. 

Simulation 

We model the above phenomenon of bee foraging using the mod- 
ified utility function of Section Methods. This foraging problem 
of (Real, 1981) is treated as a variation of the stochastic "two- 
armed bandit" problem (Sutton and Barto, 1998), possessing no 
state (s) and 2 actions (a). We represent the colors of the flower 
("yellow" and "blue") that happens to be the only predictor of 
nectar delivery as two arms (viz. the two actions, a). Initial series 
of experimental trials is modeled to have all the blue flowers ("no- 
risk" choice) delivering 1 /xl (reward value r = 1) of nectar; 1/3 of 
the yellow flowers delivering 3 /xl (r = 3), and the remaining 2/3 
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of the yellow flowers contain no nectar at all (r = 0) (yellow flow- 
ers = "risky" choice). These contingencies are reversed at trial 15 
and stay that way till trial 40. Since the task here requires only 
a single decision per trial, we model it as an immediate reward 
problem (Equation 2.5). Hence the S for any trial t is calculated 
as in Equation 3.1.2.1 for updating the respective action value by 
Equation 3.1.2.2. 

S t = r t — Q t (a t € {blue flower , yellow flower}) (3.1.2.1) 

Qt+iM = QM) + >lQ 8 t (3.1.2.2) 

h t +i(a t ) = h t (a t ) + r/h^t (3.1.2.3) 

f ( = 5 f 2 - h t (a t ) (3.1.2.4) 

U t (a t ) = Q t {a t ) ~ asign(Q t (a t )Uh t {a t ) (3.1.2.5) 

In our simulation, the expected action value (given by Q) for 
both the flowers converges to be the same value (=1). Our model 
accounts for the risk through the variance (represented by "h" of 
each flower: Equations 3.1.2.3, 3.1.2.4) component in the util- 
ity function (Equation 3.1.2.5) that plays a key role in the action 
selection. 

Results 

In the experiment (Real, 1981), most of the bees visited the con- 
stant nectar yielding blue flowers initially i.e. they chose a risk- free 
strategy, but later the choice switched to the yellow flowers, once 
the yellow became the less risky choice. We observe the same 
in our simulations too. Risk-aversive behavior being an optimal 
approach during the positive rewarding scenario, the blue flow- 
ers that deliver a steady reward of 1 have higher utility and are 
preferred over the more variable yellow flowers initially. The sit- 
uation is reversed after trial 15 when the blue flowers suddenly 
become risky and the yellow ones become risk-free. Here, the util- 
ity of the yellow flowers starts increasing, as expected. Note that 
the expected action value for both flowers still remains the same, 
though the utility has changed. 

With r/ h = 0.051, r/ Q = 0.001, a = 1.5 in Equation 3.1.2.5, 
and yS = 10 in Equation 2.9 for the simulation, the proposed 
model captures the shift in selection in less than 5 trials from 
the indication of the contingency reversal (red line in the 
Figure 1). Since the value is always non-negative, and a > 0, our 
model exhibits risk-averse behavior, similar to the bees in the 
study. 

RISK SENSITIVITY AND RAPID TRYPTOPHAN DEPLETION 

Experiment summary 

Now we show that the above risk based decision making by 
5HT-DA model framework can also explain the Long et al. (2009) 
experiment on risk sensitivity under conditions of Tryptophan 
depletion. Their experiment required the monkey to saccade to 
one of two given targets. One target was associated with a guar- 
anteed juice reward (safe) and the other with a variable juice 
volume (risky). A non-linear risk sensitivity toward juice rewards 
by adopting risk-seeking behavior for small juice rewards and 
risk aversive behavior for the larger ones (Long et al., 2009) 
was observed in the monkeys. They showed that when brain 
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FIGURE 1 | Selection of the blue flowers obtained from our simulation 
(Sims) as an average of 1000 instances, that adapted from Real (1981) 
experiment (Expt), and red line indicating contingency reversal. 



5HT levels are reduced by Rapid Tryptophan Depletion (RTD), 
monkeys preferred risky over safer alternatives (Long et al., 2009). 
Tryptophan acts as a precursor to 5HT and therefore reduction in 
tryptophan causes reduction in 5HT. 

Simulation 

The juice rewards r 1 , represented in Long et al. (2009) as open 
time of the solenoid used to control the juice flow to the mouth 
of the monkeys, are given in Table 1. The non-linearity in risk 
attitudes observed by the monkeys is accounted for in the model 
by considering a reward base (r ) that is subtracted from the juice 
reward (r J ) obtained. The resultant subjective reward (r) is treated 
as the actual immediate reward received by the agent (Equation 
3.2.2.1). Subtracting r b from r-', associates any r' < r b with an 
effect similar to losses (economy), and any r 1 > r with gains. 

r=r ] -r b (3.2.2.1) 

The reward base (r b ) used in the experiment is 193.2. A 
separate utility function U t , is computed using Equation 2.8 
for each state V tabulated in (Table 1) and action choice, a 
(a e [safe target, risky target}) pair. This is also modeled as an 
immediate reward problem and the subjective reward given by 
Equation 3.2.2.1 is used for the respective (state, action) pair's 
TD error calculation (Equation 2.5). The action value function 
is updated over trials using Equation 2.3 and the risk updates 
are using Equation 2.6 for any (state, action) pair described 
above. 

Results 

Here we examine the following conditions: (1) overall choice, 
(2) equal expected value (EEV) and (3) unequal expected value 
(UEV). In EEV cases, saccade to either the safe or the risky target 
offered the same mean reward, as shown in the first four states 
(s) of the (Table 1). In UEV cases, the mean reward maintained 
for the two targets is not the same, as in the last two states (s) of 
the (Table 1). The optimized 5HT parameter (used in Equation 
2.8), a, is equal to 1.658 for the RTD condition and is 1.985 
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(a) Overall Choice (b) Unequal EV (c) Equal EV 



FIGURE 2 | Comparison between the experimental and simulated 
results for the (A) overall choice (B) Unequal EV (C) Equal EV, under 
RTD and Baseline (control) condition. Error bars represent the SE with 
size "W" = 100. The experiment (Expt) and the simulation (Sims) result of 
any condition did not reject the null hypothesis, which proposes no 
difference between means, with P value > 0.05. Here the experimental 
results are adapted from Long et al. (2009). 



Table 1 | The sample reward schedule adapted from Long et al. (2009). 



Serial no. Safe target (ms) Risky targets (ms)— each 

with probability 0.5 

(STATES, "s") (ri) 

1 150 125,175 

2 150 100,200 

3 150 50,250 

4 140 40,240 

5 200 40,240 

6 210 40,240 



for the baseline (control) condition. The optimized ft used in 
Equation 2.9 is 0.044. Long et al. (2009) demonstrated a sig- 
nificant reduction in choosing safe option on lowering the 5HT 
levels in brain. This was seen irrespective of the options possess- 
ing equal or unequal expected value (EEV7 UEV). Our simulation 
results also generated a similar trend for EEV and UEV condi- 
tions (Figure 2: Sims) as that of experimental results [Figure 2: 
expt adapted from Long et al. (2009)]. The classical RL model 
would fail to account for such a result in the selection of safe 
option especially in the EEV case, where that model would pre- 
dict equal probability (= 0.5) for selecting both the safe and risky 
rewards. 

TIME SCALE OF REWARD PREDICTION AND SEROTONIN 

Experiment summary 

In this section, we show using the model of Section Methods that 
the « parameter that represents 5HT is analogous to the time- 
scale of reward integration (y as in Equation 2.2) as described 
in the experiment of Tanaka et al. (2007). In order to verify the 
hypothesis that 5HT corresponds to the discount factor, y (as in 
Equation 2.4), Tanaka et al. (2007) conducted an experiment in 
which subjects performed a multi-step delayed reward choice task 
under an fMRI scanner. Subjects had to choose between a white 
square leading to a small early reward and a yellow square lead- 
ing to a large but delayed reward (Tanaka et al., 2007). They were 
tested in: ( 1 ) tryptophan depleted, (2) control and (3) excess tryp- 
tophan conditions. At the beginning of each trial, subjects were 
shown two panels, each consisting of white and yellow squares, 
respectively. The two panels were occluded by variable numbers 
of black patches. When the subjects selected any one of the pan- 
els, a variable number of black patches are removed from the 
selected panel. When either panel was completely exposed, reward 
was provided. One of the panels (yellow) provided larger reward 
with greater delay; the other (white) delivered a smaller reward 
but after a shorter delay. A total of 8 trials were presented to 
each subject and the relative time delay ranges set for the white 
and the yellow panels are (3.75~11.25 s, 15~30s) in four trials, 
(3.75~11.25s, 7.5~15s) in two trials, and (1.6~4.8s, 15~30s) 
and (1.6~4.8 s, 7.5~15 s) in one trial each. 

Simulation 

We modeled the above task with the state variable, s, representing 
the number of black patches in a panel and action, a, as choosing 



any one of the panels. Each simulation time step equals one exper- 
imental time step of 2.5 s. The initial number of black patches on 
the white and yellow panels are 18 ± 9, and 72 ± 24 respectively. 
The number of patches removed varied between trials, and are 
given for the white panel and the yellow panel as follows (Tanaka 
et al, 2007). They are (S s , Si) = (6 ± 2, 8 ± 2) in 4 trials, (6 
± 2, 16 ± 2) in 2 trials, and (14 ± 2, 8 ± 2), (14 ± 2, 16 ± 
2) in the remaining 2 trials respectively. The above 8 trials are 
repeated for all three tryptophan conditions viz. depleted, con- 
trol and excess. Finally the reward associated with the white panel 
is r = 1 and with that of yellow is r = 4. Since there is a delay in 
receiving the reward, the TD error formulation used in Equation 
3.3.2.1 is used for updating the value of the states (denoting the 
discounted expectation of reward from a particular number of 
patches in a panel). The action of removing certain patches from 
a panel actually leads to another resultant state with a reduced 
number of patches. Hence at any particular "t" the resultant states 
of white and yellow panels are compared for action selection. 
While the value function is updated using Equation 3.3.2.2, the 
risk function is updated as in Equations 3.3.2.3, 3.3.2.4. The agent 
is then made to choose between the utility functions given by 
Equation 3.3.2.5 of both the panels at time, t. Eventually the panel 
that is completely exposed is labeled as selected for a particular 
trial. 

S t = r t + i + YQt (s t + i) - Qt (s t ) (3.3.2.1) 
Qt+dst) = Qt(s t ) + r, Q S t (3.3.2.2) 
h + i(s t ) = h t (s t ) + ri h ^ t (3.3.2.3) 
£ f = Sj-h t (s t ) (3.3.2.4) 
U t (s t ) = Q,(s t ) - a sign(Q t (s t ))^%Wj (3.3.2.5) 

Results 

In Figure 3A, for sample values of y = (0.5, 0.6, 0.7) used in 
Equation 3.3.2.1, the probability of selecting larger reward is 
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plotted as a function of a. Note that for constant y, the proba- 
bility of selecting delayed reward increases with a. The /3 used to 
report the Figure 3 is 20. The change of value (Q) and risk func- 
tion (h) as a function of the states, s (# of black patches) of each 
panel is shown in Supplementary material for various values of 
y. If a is interpreted as 5HT level, delayed deterministic reward 
choices are favored at higher 5HT levels. Thus a in our model 
effectively captures the role of y in the experiment of Tanaka 
et al. (2007) for functionally representing the action of 5HT in 
the striatum of BG. In addition, a trend of increasing differences 
between the utilities of the yellow and the white panels as a func- 
tion of the state, s t , could be seen on increasing the value of a 
(Figure 3B). This is similar to the increasing differences of value 
functions for states, s t , between the yellow and white panels on 
increasing the value of y (Figure 3B, Supplementary material). 
These differences in values / utilities are of prime importance 
for deciding the exploration/ exploitation type of behavior by any 
policy such as that in Equation 2.9. 

Our goal in the Section Time Scale of Reward Prediction 
and Serotonin is to relate our model's serotonin correlate (a in 
Equation 2.8) to that proposed in experiment of Tanaka et al. 
(2007) (y as in Equation 2.2) in striatum. The differential activity 
of striatum observed in fMRI of the subjects in different tryp- 
tophan conditions was indeed modeled in Tanaka et al. (2007) 
via value functions (Equations 2.2-2.3) with different y values. 



Specifically, the value generated by a lower (higher) y value bet- 
ter modeled the striatal activity following tryptophan depletion 
(excess tryptophan). An increase in y results in a value distribu- 
tion, which when expressed with a particular value of /3 (Equation 
2.9), would increase the probability of selecting the delayed but 
larger rewards (Sutton and Barto, 1998). Note that the subjects in 
Tanaka et al. (2007) show no great preference to one action over 
the other, though the striatal activity levels in subjects show sen- 
sitivity to y values. This could be because action selection is not 
singularly influenced by the striatum and is probably influenced 
by downstream structures like GPi (Globus Pallidus — interna), 
or parallel structures like STN (SubThalamic Nucleus) and GPe 
(Globus Pallidus — externa) (Chakravarthy et al., 2010). Doya 
(2002) suggested that the randomness in action selection, which 
has been parametrized by fi (Equation 2.9) in RL models, can be 
correlated by the effect of norepinephrine on the Pallidum. Thus 
for sufficiently small /3, it is possible to obtain equal probability 
of action selection, though the corresponding utilities might suf- 
ficiently different. The focus of this section is to draw analogies 
between the discount parameter y of classical RL models, and 
a parameter in our utility-based model, as substrates for 5HT 
function in striatum. 

REWARD/PUNISHMENT PREDICTION LEARNING AND SEROTONIN 

Experiment summary 

The ability to differentially learn and update action selection by 
reward and punishment feedback is shown to change on alter- 
ing the tryptophan levels in subjects. We model a deterministic 
reversal learning task (Cools et al., 2008; Robinson et al., 2012) 
in which the subjects were presented with two stimuli, one asso- 
ciated with reward and the other with punishment. On each 
trial, the subjects had to predict whether the highlighted stimulus 
would lead to reward or punishment response. The subjects were 
tested in either a balanced or a depleted tryptophan levels (drink), 
on their association of the stimulus to the corresponding action 
at any time. Erroneous trials were followed by the same stimulus 
till it has been predicted by the subject correctly and the same is 
adopted in the simulations too. Trials were grouped into blocks. 
Each subject performed 4 experimental blocks, which were pre- 
ceded by a practice block in order to familiarize the subject with 
the task. Each experimental block consisted of an acquisition stage 
followed by a variable number of reversal stages. One of two 
possible experimental conditions was applied to each block. The 
experimental conditions were: unexpected reward (punishment) 
condition where a stimulus previously associated with punish- 
ment (reward) becomes rewarding (punishing). Since there are 
4 blocks of trials, there were two blocks for each condition. 
Performance of the subjects in the non-reversal trials was eval- 
uated as a function of — (a) drink and condition (unexpected 
reward/unexpected punishment), and (b) drink and outcome 
(reward/punishment) trial type. Results showed that performance 
did not vary significantly with condition in both balanced and 
tryptophan depleted cases. Errors were fewer for tryptophan 
depleted cases than balanced cases in both conditions. Specifically, 
errors were fewer for punishment-prediction trials compared 
to reward-prediction trials in tryptophan-depleted cases. Thus 
the experiment suggests that tryptophan-depletion selectively 
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FIGURE 3 | (A) Selection of the long term reward as a function of a. 
Increasing y increased the frequency of selecting the larger and more 
delayed reward. Increasing a also gave similar results for a fixed y. 
(B) Differences in the utilities (U) between the yellow and white panels 
averaged across trials for the states, s t , as a function of y and a. Here 
N= 2000. 
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enhances punishment-prediction relative to reward-prediction. 
Please refer (Cools et al., 2008) for a detailed explanation of the 
experimental setup and results. 

Simulation 

We model the two stimuli as states, s (s e {si, S2}), and the 
response of associating a stimulus to reward or punishment as 
action, a (action a e {ai = reward, a 2 = punishment}) . At any 
particular trial, t, the rewarding association is coded by r t = +1, 
and the punitive association is coded by r t = —1. This is treated 
as an immediate reward problem and the TD error calculation 
in Equation 2.5 is used. As in the experiments, three types of 
trials are simulated as follows: non-reversal trials in which the 
association of a stimulus — response pair is learnt; reversal trials 
in which the change of the learnt association is triggered; and 
the switch trials where the reversed associations are tested follow- 
ing the reversal trials. The setup followed is similar to that of the 
experiment: The maximum numbers of reversal stages per exper- 
imental block are 16, with each stage to continue till the correct 
responses fall in the range of (5-9). The block terminates auto- 
matically after 120. There are two blocks in each condition, and 
hence a total of 480 trials (4 blocks) conducted per agent. The 
design of the experiment has an inbuilt complementarity in the 
association of the actions to a particular stimulus (increasing the 
action value of a\ for a stimulus, s, decreases the same of a 2 to 5) 
and that of the stimuli to a particular action (increasing the action 
value of 5i to a decreases the same for s 2 to a). Hence in the sim- 
ulations, the action values associated [Q f (s f , a t ) as in Equation 
2.3] with the two actions [Q(s, a t ) and Q(s t ,a 2 )] for any particu- 
lar state V are simulated to be complimentary (Equation 3.4.2.1) 
at any trial "t." 

Q(s, ai ) = -Q(s, a 2 ) (3.4.2.1) 

The action values of the two stimuli, 5, [Q(si, a) and Q(s 2 ,a)] 
mapped to the same action, a are also complimentary (Equation 
3.4.2.2) at any trial "t." 

Q(s u a) = -Q(s 2 ,a) (3.4.2.2) 

Hence, only one out of the four value functions [Q(si, ai), Q(si, 
&i)> Q(s2> ai), Q(s2, a2),] are learnt by training while the other 
3 are set by the complementarity rules to capture the experi- 
mental design. We assume that such a complementarity could be 
learnt during the initial practice block that facilitated familiarity. 
The action (response) selection is by setting the /3 of the policy 
Equation 2.9 optimized to 10, and executing the same policy on 
the utilities (Equation 2.8) of the two responses {a) for any given 
stimulus (s) at a trial (f ). The risk functions for the same are given 
by Equation 2.6. 

Results 

In the non-reversal trials, all the errors with respect to the 
drink and the condition (viz., unexpected reward and unexpected 
punishment) are featured in the Figure 5. The errors with respect 
to the drink and the outcome (viz., reward and punishment pre- 
diction errors) in both conditions are shown in Figure 4. Our 



results (Figure 4: sims values) show that the reward prediction 
error in the simulations does not vary much from the balanced 
(optimized a = 0.5 representing control tryptophan) condition 
to the tryptophan depleted (represented by optimized a = 0.3) 
condition, but the punishment prediction error decreases thereby 
matching the experimental results [Figure 4: expt values adapted 
from Cools et al. (2008). The errors in unexpectedly rewarding 
and punitive trials are obtained to be the same in both the bal- 
anced and tryptophan depleted cases (Figure 5: sims values) again 
matching with the experiment [Figure 5: expt values adapted 
from Cools et al. (2008)]. Therefore, increased 5HT levels in bal- 
anced condition are seen promoting the inhibition of responses to 
punishing outcomes as proposed by Cools et al. (2008). Reducing 
5HT via tryptophan depletion then removes this inhibition. We 
can see a similar result from (Figures 4, 5) depicting balanced 
(« = 0.5) and the tryptophan depleted (a = 0.3) conditions. 
Sign(Q t ) term in Equation (3.3.2.5) plays a crucial role in this 
differential response to gains (rewards) and losses (punishments) 
(analysis of the results on removing the Sign(Q t ) term is provided 
in Supplementary material). As the data is in the form of counts, 
the errors are reported as SQRT (error counts) (Cools et al, 2008) 
in Figures 4, 5. 

DISCUSSION 

MAIN FINDINGS OF THE MODEL 

Reinforcement Learning framework has been used extensively 
to model the function of basal ganglia (Frank et al., 2007; 
Chakravarthy et al., 2010; Krishnan et al, 2011; Kalva et al., 
2012). The starting point of our model was to understand the 
contributions of serotonin in BG function (Tanaka et al., 2009; 
Boureau and Dayan, 2011). We use the notion of risk, since 
serotonin is shown to be associated with risk sensitivity. Some 
instances are as follows: On presentation of the choices with 
risky and safe rewards, the reduction of central serotonin levels 
favor the selection of risky choices comparative to the baseline 
levels (Long et al., 2009). The non-linearity in risk-based deci- 
sion making — risk aversivity in the case of the gains and risk 
seeking in the case of losses, is postulated to be affected by 
central serotonin levels (Murphy et al., 2009). Negative affec- 
tive behavior such as depression, anxiety and other behavior 
such as impulsivity caused due to the reduction of the central 
serotonin levels, is argued to be a risky choice selection in a 
risk based decision making framework (Dayan and Huys, 2008). 
Based on the putative link between serotonin function and risk 
sensitivity, we have extended the classical RL approach of pol- 
icy execution using the utility function (Equation 2.8) instead 
of value function. The utility function combines value function 
with risk function. We propose that the weightage (a) that com- 
bines value and risk in the utility function represents serotonin 
(5HT) functioning in BG. Using this formulation, we show that 
three different experimental paradigms instantiating diverse the- 
ories of serotonin function in the BG can be explained under a 
single framework. 

The proposed model is applied to different experimental 
paradigms. The first is a bee foraging task in which bees choose 
between yellow and blue flowers based on the associated risk 
(Real, 1981). The proposal model is applied to this simple 
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FIGURE 4 | The mean number of errors in non-switch trials (A) as a 
function of "a" and outcome trial type; "a = 0.5" (balanced) and 
"a = 0.3" (Tryptophan depletion). Error bars represent standard errors of 
the difference as a function of "a" in simulation for size "W" = 100 (Sims). 



(B) Experimental error percentages adapted from Cools et al. (2008). Error 
bars represent standard errors as a function of drink in experiment (Expt). The 
results in (B) were reported after the exclusion of the trials from the 
acquisition stage of each block. 
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FIGURE 5 | The mean number of errors in non-switch trials as a 
function condition; Simulation (sims): "a = 0.5" (balanced) and 
"a = 0.3" (Tryptophan depletion). Experimental (Expt) results adapted 
from Cools et al. (2008). Error bars represent standard errors either as a 
function of drink in experiment, or a in simulation for size " N" = 100. 



instance of risk based decision making, though the experiment 
does not particularly relate to DA and 5HT signaling. The risk 
sensitivity reported in the bee foraging experiment is predicted 
by our model (for a = 1) accurately. 

Next we model experiments dealing with various functions 
of 5HT. One such experiment links 5HT levels to risky behav- 
ior. Experiments by Long et al. (2009); Murphy et al. (2009) 
discuss associating 5HT levels to non-linear risk sensitivity in 
gains and losses. In our study, we model a classic experiment 
by Long et al. (2009) describing the risk sensitivity in monkeys 
on depleting 5HT level. With our model, the effect of increased 
risk-seeking behavior in RTD condition is captured with param- 
eter a = 1.658 and the baseline condition with a = 1.985. This 
result shows that our model's 5HT-correlate "a" can control risk 
sensitivity. 

The third experiment is a reward prediction problem (Tanaka 
et al, 2009) associating 5HT to the time scale of prediction. 
Herein the subjects chose between a smaller short-term reward 
and a larger long-term reward. Our modeling results show that 
for a fixed y, increasing a increases the probability of choosing 
the larger, long-term reward. Since higher a denotes higher 5HT 
level, the model corroborates the experimental result, suggesting 



that our model's 5HT-correlate "a" behaves similar to the time 
scale of reward prediction. 

Finally the fourth experiment is to show the differential effect 
of 5HT on the sensitivity to reward and punishment predic- 
tion errors. Under conditions of balanced 5HT (a = 0.5), the 
model is less sensitive to punishment and commits more errors 
in predicting punishment; this trend is rectified in depleted 5HT 
(a = 0.3) condition. For numerical analysis of reward and pun- 
ishment prediction error, the experiment by Cools et al. (2008) 
did not take the acquisition trials into consideration. However, 
these trials serve to learn the initial association between stimulus 
and response. They also act as a base for the forthcoming reversal 
and switch trials and are hence taken into analysis in our simu- 
lation. This differential effect shown by the model 5HT-correlate 
"a" toward punishment corroborates the experimental evidence 
linking 5HT to adverse behavior exhibited in psychological disor- 
ders like depression and anxiety (Cools et al, 2008, 201 1; Boureau 
and Dayan, 2011). 

Simulation results thus show that the proposed model of 5HT 
function in BG reconciles three diverse existing theories on the 
subject: (1) risk-based decision making, (2) time-scale of reward 
prediction and (3) punishment sensitivity. To our knowledge this 
is the first model that can reconcile the diverse roles of serotonin 
under a simple and single framework. 

SIGNIFICANCE OF SIGN(Q t ) 

The sign(Q t ) term presented in the modified formulation of util- 
ity function (Equation 2.8) denotes the preference for risk in a 
given context of the experiment. At high mean reward values 
humans are found to be risk-averse, whereas at low mean reward 
values they are risk-seeking (Kahneman and Tversky, 1979). In 
neuroeconomic experiments, this risk preference is statistically 
determined, for example, by maximizing the log likelihood of 
the decisions (D'acremont et al., 2009). Though this method esti- 
mates the risk preference subjectively, it is derived from decisions 
made throughout the experiment. The use of sign(Q t ) in our 
model takes into account the variation of the subjective risk pref- 
erence, according to the expected cumulative reward outcomes 
observed within an experiment. The significance of this term 
in the formula of modified utility (Equation 2.8) can be seen 
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from the Supplementary material. This Supplementary material 
presents the results of simulating the experiment by Cools et al. 
(2008) with an altered model having no sign(Q t ) term in the util- 
ity function of Equation (2.8). The mean number of errors does 
not vary as a function of both trial type and condition, for dif- 
ferent values of "a," contrary to what happens in the experiment. 
Thus sign(Q t Jterm is essential for simulating the results of Cools 
etal. (2008). Such a behavior of nonlinear risk sensitivity has been 
shown to be modulated by 5HT in various experiments (Long 
et al., 2009; Murphy et al., 2009), which further strengthens our 
proposal of introducing the term sign(Q t ) in Equation (2.8). 

5HT-DA INTERACTION IN THE "RISK" COMPONENT OF DECISION 
MAKING 

The risk part of the utility function (Equation 2.8) has three 
components: a, sign(Q t )and ^/h t . While "a" represents 5HT, 
the remaining two components are dependent on "5" or DA. 
Thus the proposed model of risk computation postulates a com- 
plex interaction between DA and 5HT. In neurobiology, complex 
interactions are indeed seen to exist between DA and 5HT (Di 
Matteo et al., 2008a,b) at the cellular level that are not detailed 
in this present abstract model. The 5HT afferents from dorsal 
raphe nucleus differentially modulate the DA neurons in SNc 
and ventral tegmental area (VTA) (Gervais and Rouillard, 2000). 
The 5HT projections act via specific receptor subtypes in the DA 
neurons. Action of 5HT 1A, 5HT IB, 5HT 2A, 5HT 3, 5HT 4 
agonists facilitate dopaminergic release, whereas 5HT 2C ago- 
nists inhibit the same. Selective serotonin reuptake inhibitors are 
known to reduce the spontaneous activity of DA neurons in VTA 
(Di Mascio et al, 1998; Alex and Pehek, 2007; Di Giovanni et al., 
2008). The 5HT neurons in Dorsal Raphe nucleus also receive 
dense DA innervations from midbrain DA neurons (Ferre et al., 
1994) and express D2R (Suzuki et al, 1998). 

CONTRIBUTIONS FROM EXISTING MODELS 

The previous models on 5HT seem to focus on individual func- 
tions of 5HT in isolation without reconciling them in a single 
framework. Most of them consider 5HT as a neuromodulator 
mediating aversive outcomes (Daw et al., 2002; Boureau and 
Dayan, 2011; Cools et al., 2011). Some describe 5HT as a con- 
troller of time-scale in prediction of rewards (Tanaka et al., 2007), 
and as a modulator that associates the aversive outcomes to past 
actions (Tanaka et al., 2009). Psychological disorders associated 
with lowered 5HT levels, such as impulsivity and negative moods, 
have also been studied by the existing models on 5HT. They infer 
impulsivity to be the result of increased short term reward pre- 
diction (Tanaka et al., 2007), and negative moods to increased 
punishment sensitivity, respectively (Cools et al., 2011; Robinson 
et al., 2012). Such observation may then be captured in our model 
by assessing the risk involved in the task and by controlling the 
"a" (5HT) parameter. 

STUDY PREDICTIONS AND FUTURE WORK 

Our proposed unified model is an abstract mathematical model, 
aimed at explaining a range of behavioral effects of 5HT. It is 
only a preliminary model that uses a modified RL framework and 
explains the role of 5HT and DA in the BG. It focuses mainly 



on risk computation and the role of nigrostriatal DA signal in 
shaping the learning of risk and value in BG. Ideally, a convinc- 
ing model of utility computation in BG should go beyond the 
5HT-DA interaction in the abstract representation of the value 
and the risk quantities and demonstrate how the utility com- 
putation would be carried out by the neurobiological correlates 
in BG. 

In classical Actor-Critic approaches to modeling BG function, 
value computation is thought to occur in striatum (Joel et al., 
2002). There is evidence from functional imaging that supports 
this theory (O'doherty et al., 2006). There is strong evidence 
for the existence of DA-modulated plasticity in corticostriatal 
connections, an effect that is necessary to account for value com- 
putation in the medium spiny neurons (MSNs) of striatum (see 
review by Kotter and Wickens, 1998). The idea that MSNs are 
probably cellular substrates for value computation has found its 
place in recent modeling literature (Morita et al., 2012). 

Starting from the fact that the effect of DA on the 
DI -expressing MSNs of the striatum is to increase the firing rate 
(by having an increasing gain as a function of <5), it has been 
shown in a computational model of BG that these DI -expressing 
MSNs are capable of computing value (Krishnan et al, 2011). 
Just as DIR-expressing MSNs are thought to be cellular substrates 
for value computation in the striatum (Kotter and Wickens, 
1998; O'doherty et al., 2006; Krishnan et al, 2011; Morita et al, 
2012), we propose that DlD2-coexpressing MSNs can be the 
cellular correlates for risk computation. We have already devel- 
oped a network model of BG in which risk is computed by 
DlD2-coexpressing neurons in the striatum, while value is com- 
puted by DI -expressing medium spiny neuron (unpublished). 
Just are neurons that compute value function (Equations 2.3-2.4) 
require monotonically increasing gain as a function of S in 
the MSNs, risk function (Equations 2.6-2.7) would require a 
"U-shaped" gain function as a function of <5. It is plausible 
that these risk-type of gain functions would then probably be 
exhibited by neurons that coexpress both the Dl-like gain func- 
tion that increases as with S, and D2-like gain function that 
decreases with S (Servan-Schreiber et al, 1990; Moyer et al., 
2007; Thurley et al, 2008; Humphries et al., 2009). Interestingly 
about 59% neurons in Globus Pallidus and 20-30% in ven- 
tral striatum coexpress D1R and D2R (Perreault et al., 2010). 
Even among the MSNs of the striatum, the proportion of D1R- 
D2R co-expressing neurons are greater in ventral striatal MSNs 
(17% in shell) compared to 5% in dorsal striatum (Surmeier 
et al., 1996; Bertran-Gonzalez et al., 2008). Some studies also 
point out that around 70% of the MSNs in striatum coexpress 
the DI and the D2 type receptors (Surmeier et al., 1996). The 
ventral striatum also mediates risk sensitivity in action selec- 
tion (Stopper and Floresco, 2011), the latencies of response, 
and the sensitivity to the magnitude of the rewards (Acheson 
et al., 2006; Floresco et al., 2006). This encourages us to predict 
a link between the risk-based functioning of the ventral stria- 
tum and the significant presence of the co-expressing D1R-D2R 
neurons here. We would also like to explore the plausibility of 
the functioning of D1R-D2R co-expressing neurons to the com- 
putation of the risk function and the selective modulation of 
serotonin on these risk computing neurons in future. We predict 
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therefore that selective loss of these co-expressing neurons would 
make the subject less sensitive to the risk component of the 
environment. 

The role of serotonin in reward and punishment sensitivity of 
PD subjects could also be analyzed using our proposed modeling 
approach. In experiments where reward/punishment sensitivity 
of PD subjects was studied, PD patients ON DA medication 
showed an increased reward sensitivity compared to PD OFF sub- 
jects who showed increased punishment sensitivity (Frank et al., 
2007; Bodi et al., 2009). Our proposed model, in which serotonin 
controls the weightage of risk, is expected to account for the afore- 
mentioned experimental results. Preliminary work on application 
of the proposed model to the study of (Bodi et al, 2009) gave 
encouraging results (unpublished). 

In connection with the neurobiological correlate of the - 
sign(Q t ) term, the aforementioned discussion suggests a general, 
complex interaction between DA and 5HT signals. But as a spe- 
cific circuit that can form the basis for the -sign(Q t ) term in 
Equation 2.8, we invoke the circuitry that links habenula with 
striatum. Habenula is a structure that is thought to be involved 
in brain's responses to reward, pain and anxiety (Lecourtier and 
Kelly, 2007; Hikosaka, 2010). It gained importance for its inter- 
actions with the DA and 5HT systems (Lecourtier and Kelly, 
2007; Hikosaka, 2010). It is a small structure located near the 
posterior-dorsal-medial end of thalamus. It is divided into medial 
habenula (MHb) and lateral habenula (LHb). Striatum (in partic- 
ular D1R containing striosome) and LHb are thought to form a 
negative feedback loop [LHb— >-Rostromedial Tegmental Nucleus 
(RMTgj^VTA/SNc^Striatum^Globus Pallidus^LHb], not 
via direct connections but via intermediaries (Lecourtier and 
Kelly, 2007; Hikosaka, 2010). Activation of LHb neurons inhibits 
the DA cells of VTA and SNc. This DA is also known to have 
a special action on MSNs as follows. Activation of Dl recep- 
tors is known to enhance (suppress) the activation of MSNs 
if the prior membrane state is depolarized (polarized) (Flores- 
Hernandez et al, 2002). However, we do not know if the 
action of DA on the hypothesized risk computing D1-D2 co- 
expressing neurons is one of the stabilizers of the pre-existing 
state. Based on the data reviewed above, we plan to develop 
a model in which Dl -expressing MSNs whose activity repre- 
sents value, act on D1R-D2R co-expressing MSNs via habe- 
nula, by an interaction term that can be roughly described 
by— sign(Q t ). 

Finally, a theory of 5HT and DA in the BG must go beyond 
the striatum since 5HT innervations in the BG are not con- 
fined to striatum, but include GPe, SNc, and PPN (Wallman 
et al, 2011). We plan to elucidate the role of 5HT and DA 
in these other nuclei of the BG through a more complete net- 
work model in our future. The suggested roles of DA in the 
BG include, (1) plasticity of corticostriatal connections, (2) 
switching between DP and IP by striatal DA, and (3) modu- 
lating the exploratory drive arising from the STN-GPe system 
(Chakravarthy et al, 2010; Kalva et al., 2012). Analogously, a com- 
prehensive theory of 5HT and DA in the BG is planned to be 
developed. The theory might shed light on the role of 5HT in 
some of the key functions of the BG viz., action selection/decision 
making. 



SUPPLEMENTARY MATERIAL 

The Supplementary Material for this article can be found 
online at: http://www.frontiersin.org/journal/10.3389/fncom. 
2014.00047/abstract 
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